Text Encoding

Introduction

Let’s assume we want to build a machine learning model that can classify emails as spam or not spam. Machine learning models need numbers as input, not text! So, we need to convert the each email’s text to a vector . This process is called Text Encoding.

flowchart LR
    subgraph in["Input"]
        A["Email text"]
    end
    subgraph m["Classifier"]
        B["ML model"]
    end
    subgraph out["Decision"]
        D["Spam folder"]
        E["Inbox"]
    end
    A --> B
    B -->|Spam| D
    B -->|Not spam| E

One-Hot Encoding

It is the simplest form of text encoding where each word gets a unique binary vector. In a vocabulary of size \(V\), each word is represented by a vector of length \(V\) with a single \(1\) at its position and \(0\)s everywhere else.

Example:

it = \([\mathbf{1}, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]\)
I = \([0, \mathbf{1}, 0, 0, 0, 0, 0, 0, 0, 0, ...]\)
the = \([0, 0, \mathbf{1}, 0, 0, 0, 0, 0, 0, 0, ...]\)
to = \([0, 0, 0, \mathbf{1}, 0, 0, 0, 0, 0, 0, ...]\)

Bag of Words

Emails have variable lengths, but classifiers need fixed-size vectors. So, we can not concatenate all the one-hot vectors of a document because it doesn’t lead to a fixed-size vector to represent the document. It’s the reason why we need to use Bag of Words.

Let’s focus on a single email for now. You can see the email in the image at top of the page.

“I love this movie! It’s sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun…”

To make a bag of words, we shuffle the words in the email. You can imagine this as putting the words in a bag as seen in the image at top of the page. Imagine you are going shopping and you are putting the items in a bag. You don’t care about the order of the items in the bag. You only care about their count. That’s exactly what we do in making a bag of words.

You can see the bag of words in the image at top of the page. Note that for teh sake of simplicity, we have remove d repeated words from the bag of words.

Term Frequency (TF)

Term Frequency (TF) is the vector of length \(V\) where each element is the count of the corresponding word in the document. When you go to the cashier to pay for the items in the bag, the cashier will count the number of items in the bag. That’s exactly what we do in making a term frequency vector.

Example: For the above bag of words, the term frequency vector is: \[ TF = [6, 5, 4, 3, 3, 2, 1, ..., 1, 0, ..., 0] \]

You can see the term frequency vector in the image at top of the page.

Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) is a feature engineering technique invented in 1972 by Karen Jones. IDF gives more weight to rare words and less weight to common words across documents.

\[ \text{IDF}(i) = \log\left(\frac{N}{1 + DF_i}\right) \]

Where:

\(N\) = total number of documents
\(DF_i\) = number of documents containing word \(i\)

Imagine we have 100 emails in our email database and one of the spam emails is:

You can make a lot of money by clicking on this link.

We can calculate the TF-IDF for each word in the email as follows:

Word	TF	DF	IDF
You	1	90	0.09
can	1	80	0.21
make	1	70	0.34
a	1	60	0.49
lot	1	50	0.67
of	1	90	0.09
money	1	2	3.51
by	1	80	0.21
clicking	1	10	2.21
on	1	90	0.09
this	1	80	0.21
link	1	5	2.81
.	1	100	-0.00995

It can be seen that while the words like you (IDF=0.09), a (IDF=0.49), can (IDF=0.21) have appeared in many emails, the words like money (IDF=3.51), click (IDF=2.21), link (IDF=2.81) have appeared in only a few emails. This is the reason why the IDF is high for these words. It’s not surprising because these words are more likely to be found in spam emails so they are more informative and should have more weight in the text encoding vector.

TF-IDF

Combines Term Frequency and Inverse Document Frequency to place more weight on informative words and less on uninformative ones.

\[ \text{TF-IDF}_{i,j} = \log(1+\text{TF}_{i,j}) \times \text{IDF}_i \]

Where: - \(\text{TF}_{i,j}\) = Term Frequency of word \(i\) in document \(j\) - \(\text{IDF}_i\) = Inverse Document Frequency of word \(i\)

BM25

Okapi BM25 was invented by Karen Jones et al. in the 1970s. It’s a ranking function that estimates document relevance to search queries.

Given a query Q containing keywords q₁ to qₙ: \[ \text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} \]

Where: - \(D\) = Document - \(Q\) = Query - \(n\) = Number of keywords in the query - \(f(q_i, D)\) = Frequency of keyword \(q_i\) in document \(D\) - \(k_1\) = Smoothing parameter - \(b\) = Length normalization parameter - \(\text{avgdl}\) = Average document length